One health is a relative new concept that verses about the integrated health between animals, humans and environment. Simply put it seems to be a application of ecology and evolution into health and well-being. It brings the ideia that all health are intercorrelated whicth is quite percetible in this pandemic situation that we are living right now. Having in mind that we, as humans, are generating a great evolution pressure in plants, animals and, especially, in micro-organisms. The last is the most susceptible organims to the evolution pressure, and this can be noticed by the widespread antibiotic resistance and frequent appearece of new superbugs every now and then. On top of that, humans already see the problems but do not understand it completely, so many cientists are working in this subject.
During one of our cientific meetings one proffessor brought up the fact that without previous information It is extremely difficult to understand the geographic or ambiental source of a micro-organism. This information is important to understand, for exemple if bacteria that is sampled in farms or rivers come to contact with humans.
Furthermore, I hypothesize that this information may be hidden into bacteria genome. Some bacteria might have specific traits that could specify where they live, but as everything in biology things are not that simple, surely this target data might be hidden under a complex interations beteewn genes that cannot be scrutinized by human logic. That is why we have got computers! Maybe inputing good data into Machine Learning models could bring up this interaction and moreover could bring great genetics insights about bacteria distribution throughtout the environment.
*The main objective is to find out bacteria origin in one of the sprectrums that nucleotide database give us.
*Secondarily find new insights about specific genes and specific niches may be interesting to study about
The current work was not based in no articles or studys. I read some abstracts and I have made some research on PubMed but I was not able to find any interesting reference or similar study, but It might exist. Futher digging is needed
–> Everything are being made in R language, I started to learn It in the beggining of the quarantine, at the same time that I have made a bioinformatics course. So I am pretty new at both, but this is what I was already able to do.
Getting from a online spreadsheet previous obtained data by downloading accession ids on NCBI nucleotide plataform
link <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vTgehkctLr41ea7BgkPw-jwXL5poeo0Ce5HRfnL8Ydw__tFeTCI_
sxYjkkVtcctErqhNAA2bj6Ans9J/pub?gid=1011092740&single=true&output=csv"
Accession_id <- read.csv(link) %>% as.tibble()
bacteria_data_big <- tibble()
Using rentrez to access summary data from each id and then processing this wanted data into a tibble
In regard to uid, I had problems with it because of some uid was the same for different bacteria, this must not be happen, so I used accession number
This for in loop can return error sometimes,moreover I will think in a way to resolve the problem
for (i in 1:length(Accession_id$ID)) {
print(paste0("ongoing", i ))
tryCatch({
accession_id = Accession_id[i,] %>%
as.character()
bacteria_summary <-
rentrez::entrez_summary("nucleotide", id= accession_id)
bacteria_subtype <-
bacteria_summary$subtype %>%
str_split("\\|")
bacteria_subname <-
bacteria_summary$subname %>%
str_split("\\|")
bacteria_data_lil <-
tibble(accession_id,
bacteria_subname,
bacteria_subtype)
bacteria_data_big <-
bind_rows(bacteria_data_big,
bacteria_data_lil)
}, error = function(e) {
print(paste0("Deu merda no indice:", i))
indice_erro <- c(indice_erro, i)
})
if (i == length(Accession_id$ID)) {print("complete")}
}
# Each row now has a list with the variable and the value of this variable, so first
# we unlist both, which give us a new accessible list
Moreover things run smoothly
lasting_tib <- tibble()
for (x in 1:length(Accession_id$ID)) {
print(paste0("ongoing ", x))
list_subtype = bacteria_data_big$bacteria_subtype[[x]] %>%
unlist()
list_subname = bacteria_data_big$bacteria_subname[[x]] %>%
unlist()
list_accession_id <- bacteria_data_big$accession_id[[x]]
#Now we expand our tibble putting all variables and values in two collumns
for (y in 1:length(list_subtype)) {
fleeting_tib <- tibble(ID= list_accession_id,
subtype= list_subtype[y],
subname= list_subname[y])
lasting_tib <- bind_rows(lasting_tib,
fleeting_tib)
print(paste0(y ," complete" ))
}
}
write.csv(lasting_tib,"lasting_tibble_ALL.csv")
#Because some error when writing infomartion on the NCBI plataform some collumns
#can come with missing values, this would generate problems in the following steps.
#Opening one of those bateria that had generated problems I was able to see that there was none
#so the problem must be in the program that have made the scraping
#So we must drop these blank values
lasting_tib %>%
mutate_all(na_if, "") %>% #summary of na values
is.na() %>%
summary()
lasting_tib %>% #summary of treated
mutate_all( na_if, "") %>%
drop_na() %>%
is.na() %>%
summary()
lasting_tib_treated <-
lasting_tib %>%
mutate_all( na_if, "") %>%
drop_na()
# At last using pivot wider we generate new collumns for the variables and complete
# them with value or NA
Here I eliminate all the Bacteria thet could generate a Problem
#finding duplicates
lasting_tib_treated <- lasting_tib_treated[,2:length(lasting_tib_treated)]
bacteria_dataset_lenght %>% summary()
#here you can see if is there any duplicates on the data set
#If the mean equal 1, everything is alright, but if not it means that
# one bacteria shows up more then once, for pragramic reasons I'll drop then
#Since I found problems only in host I made this resolution only for it, if the problem exists
#other variable you must do the same but with the other variable
ID2_drop <-
lasting_tib_treated %>%
filter( subtype == "host") %>%
mutate(rn = data.table::rowid(ID)) %>%
filter(rn >= 2) %>% select(1) %>% unlist()
lasting_tib_treated <- lasting_tib_treated %>% filter(!ID %in% ID2_drop)
# At last using pivot wider we generate new collumns for the variables and complete
# them with value or NA
bacteria_dataset <-
lasting_tib_treated %>%
pivot_wider(names_from = subtype,
values_from = subname)
write.csv(bacteria_dataset, "bacteria_dataset_ALL.csv" )
Target data still pretty crude, there are too many levels and these levels must be distribuited in few categories.
#Retriving data
analysis_data <-
read.csv("bacteria_dataset_ALL.csv")
#Analysis of data
analysis_data %>%
arrange() %>%
head()
The only data that I suppose to be wothy to study was host, isolation source and country. Therefore, next we have some data about then
analysis_data %>%
colnames()
## [1] "X" "ID" "strain"
## [4] "sub_strain" "country" "isolation_source"
## [7] "lat_lon" "collection_date" "note"
## [10] "host" "serotype" "pathovar"
## [13] "collected_by" "serovar" "culture_collection"
## [16] "isolate" "genotype" "old_name"
## [19] "serogroup" "subtype" "subgroup"
## [22] "old_lineage" "sub_species" "type_material"
## [25] "identified_by" "chromosome" "phenotype"
## [28] "altitude" "specimen_voucher" "dev_stage"
Next I’ve made a small treatement and tibbles that count the frequency of the values for the three variables. This should help to categorize the data in the future
country_levels <-
analysis_data$country %>%
as.factor() %>%
levels()
country_levels %>%
sample() %>%
head()
## [1] "Brazil: Jardinopolis"
## [2] "Dominican Republic: Santo Domingo"
## [3] "Japan: Yokosuka"
## [4] "France: Libourne"
## [5] "USA: Lahey Hospital & Medical Center"
## [6] "USA: MA"
analysis_data$country %>%
str_remove_all("\\:.*") %>%
as.tibble() %>%
na.omit() %>%
count(value) %>%
arrange(desc(n)) %>%
head()
analysis_data$country %>%
str_remove_all("\\:.*") %>%
na.omit() %>%
as.factor() %>%
levels()
## [1] "Afghanistan" "Algeria" "Antarctica"
## [4] "Argentina" "Australia" "Austria"
## [7] "Bahrain" "Bangladesh" "Belgium"
## [10] "Bolivia" "Brazil" "Brunei"
## [13] "Bulgaria" "Burkina Faso" "Cambodia"
## [16] "Cameroon" "Canada" "Chile"
## [19] "China" "Colombia" "Costa Rica"
## [22] "Croatia" "Cuba" "Czech Republic"
## [25] "Denmark" "Dominican Republic" "Ecuador"
## [28] "Egypt" "Estonia" "Finland"
## [31] "France" "Gabon" "Gambia"
## [34] "Georgia" "Germany" "Ghana"
## [37] "Greece" "Guam" "Guinea"
## [40] "Guinea-Bissau" "Guyana" "Hong Kong"
## [43] "Hungary" "India" "Indonesia"
## [46] "Iran" "Ireland" "Israel"
## [49] "Italy" "Japan" "Jordan"
## [52] "Kazakhstan" "Kenya" "Korea"
## [55] "Kosovo" "Laos" "Latvia"
## [58] "Lebanon" "Lithuania" "Malawi"
## [61] "Malaysia" "Mali" "Mexico"
## [64] "Morocco" "Mozambique" "Myanmar"
## [67] "Netherlands" "New Zealand" "Niger"
## [70] "Nigeria" "Norway" "Pakistan"
## [73] "Panama" "Papua New Guinea" "Paraguay"
## [76] "Peru" "Philippines" "Poland"
## [79] "Portugal" "Puerto Rico" "Qatar"
## [82] "Republic of the Congo" "Romania" "Russia"
## [85] "Rwanda" "Saudi Arabia" "Serbia"
## [88] "Singapore" "Slovakia" "Slovenia"
## [91] "South Africa" "South Korea" "Spain"
## [94] "Sri Lanka" "Sudan" "Sweden"
## [97] "Switzerland" "Taiwan" "Tajikistan"
## [100] "Tanzania" "Thailand" "Togo"
## [103] "Tonga" "Tunisia" "Turkey"
## [106] "Ukraine" "United Arab Emirates" "United Kingdom"
## [109] "Uruguay" "USA" "Viet Nam"
## [112] "Zaire" "Zambia"
hosts_levels <-
analysis_data$host %>%
as.factor() %>%
levels()
hosts_levels %>%
sample() %>%
head()
## [1] "swine" "Cygnus buccinator" "okapi"
## [4] "Cow" "Bos indicus" "Rat"
analysis_data$host %>%
na.omit() %>%
as.tibble() %>%
count(value) %>%
arrange(desc(n)) %>%
head()
isolation_source_levels <-
analysis_data$isolation_source %>%
as.factor() %>%
levels()
isolation_source_levels %>%
sample() %>%
head()
## [1] "CGC"
## [2] "Urine sample from patient with Septicemia"
## [3] "poultry fecal material"
## [4] "Sanger Centre via Imperial College"
## [5] "slaughterhouse"
## [6] "cow small intestine"
analysis_data$isolation_source %>%
str_replace_all("[sS]tool.*", "Stool") %>%
str_replace_all("[Ff]ece.*", "Feces") %>%
str_replace_all("wastewater.*", "wastewater") %>%
na.omit() %>%
as.tibble() %>%
count(value) %>%
arrange(desc(n)) %>%
head()
From a genetic approach this data could be things related to many aspects of bacteria genetics:
Well at first glance the more simple to do would be presence or of all genes, then the following steps are foucused in getting all genes from annotated genomes of NCBI database
bacteria_id <- analysis_data$ID
big_gene_tib <- tibble()#generating empty tibble
erro_indice <- tibble()
#Loop for that, based on bacteria ID, parse genbank files and then select genes and
#put them on a tibble. To parse the genome genbankr from bioconductor is used
tib_genome <- #function that produce a tibble with gene infomration and targets informations
function(genome) { #that was found using rentrez package
data_2bind <- analysis_data %>%
filter(
ID == bacteria_id[i]) %>%
select(
sub_strain,
country,
isolation_source,
host)
tibble(
data_2bind,
bacteria_ID = bacteria_id[i],
type = genome$type,
genes_composition = genome$gene,
locus_tag = genome$locus_tag,
loctype = genome$loctype,
gene_ID = genome$gene_id
)
}
checking_true <- 0
checking_false <- 0
for ( i in 1:length(bacteria_id)) {
print(paste0( i ,' accessing genome ', bacteria_id[i]))
tryCatch({
gene_accession <-
genbankr::GBAccession(bacteria_id[i])
parsing_genome <-
genbankr::readGenBank(gene_accession) #slower line to be processed
getting_genes <-
parsing_genome %>%
genbankr::genes()
}, error = function(e){
print(paste0("Deu merda no indice: ", i))
erro_indice_lil <- bacteria_id[i] %>% as.tibble()
erro_indice <- bind_rows(erro_indice, erro_indice_lil)
})
print(paste0("seeing if there's gene annotation in ", bacteria_id[i]))
#some genomes does not have annotation, therefore, there is two options, the first
#would be save only those that have it. the other would be make the annotations based
#on database. The last would take a really long time studying and processing the data
if (length(getting_genes) >= 1 ) {
print(paste0(bacteria_id[i], " checked"))
gene_data <- #here I select eligible variables and drop na values in genes
tib_genome(genome = getting_genes ) %>%
select(bacteria_ID,
country,
host,
isolation_source,
genes_composition) %>%
mutate(rn = data.table::rowid(genes_composition)) %>%
drop_na(genes_composition)
gene_data <- #here I eliminate all genes that are repeated
gene_data[!duplicated((gene_data$genes_composition)),]
checking_true <-
checking_true + 1
big_gene_tib <-
bind_rows(big_gene_tib,
gene_data)
} else {
print(paste0(bacteria_id[i] ," ", length(getting_genes) >= 1))
checking_false <- checking_false + 1
}
}
#check if data was correctly prospected is important. Then I sould work to find out how to
#do that
count_TF <- tibble(checking_false, checking_true)
expected_T <-
big_gene_tib$bacteria_ID %>%
unique() %>%
length()
expected_T == count_TF$checking_true
#saving brute data
write.csv(big_gene_tib, "big_gene_tib_ALL.csv", row.names = F)
This step is a complicated one, many problems happend throughtout the production of the script.
[1] "1167 accessing genome NZ_AKVX01000001.1"
Translation product seems to be missing for 53 of 4294 CDS annotations. Setting to ''
No exons read from genbank file. Assuming sections of CDS are full exons
No transcript features (mRNA) found, using spans of CDSs
[1] "seeing if there's gene annotation in NZ_AKVX01000001.1"
[1] "NZ_AKVX01000001.1 checked"
* No transcript features (mRNA) found, using spans of CDSs
Error in entrez_check(response) :
HTTP failure: 502, bad gateway. This error code is often returned when trying
to download many records in a single request.
Try using web history as described in the rentrez tutorial`
Anyway I was able to make this tibble with all test varibles and targets variable.
## [1] "Number of test variables"
## [1] 6455
## [1] "Number of bacteria"
## [1] 492
Further, having in mind that some annotations were lost in parsing I wanted to know the frequency of annotations in the genome. How many annotated genes there are in most of E. coli ?
big_gene_tib %>%
group_by(bacteria_ID) %>%
count() %>%
arrange(desc(n)) %>%
head()
-> The most largely annotated genome has 4566 genes
big_gene_tib %>%
group_by(bacteria_ID) %>%
count() %>%
arrange(n) %>%
head()
-> The least annotated genome has 26 gens
Counting_genomes <-
big_gene_tib %>%
group_by(bacteria_ID) %>%
count() %>%
arrange(desc(n))
Counting_genomes$n %>%
as.numeric() %>%
median()
## [1] 2795
Counting_genomes$n %>%
as.numeric() %>%
mean()
## [1] 2513.25
box <- Counting_genomes %>%
mutate(GENES_annotated = "GENES_annotated") %>%
ggplot(aes(y= n, x= GENES_annotated ))+
geom_boxplot() +
theme_minimal()
graph <- Counting_genomes %>% ggplot(aes(x = n)) +
geom_histogram(aes(y=..density..),binwidth = 15, alpha= .7) +
geom_density(fill="blue", alpha= .2)+
theme_minimal()
plotly::ggplotly(graph)
plotly::ggplotly(box)
Error in entrez_check(response) :
HTTP failure: 502, bad gateway. This error code is often returned when trying
to download many records in a single request.
Try using web history as described in the rentrez tutorial
[1] "1167 accessing genome NZ_AKVX01000001.1"
Translation product seems to be missing for 53 of 4294 CDS annotations. Setting to ''
No exons read from genbank file. Assuming sections of CDS are full exons
No transcript features (mRNA) found, using spans of CDSs
[1] "seeing if there's gene annotation in NZ_AKVX01000001.1"
[1] "NZ_AKVX01000001.1 checked"
No transcript features (mRNA) found, using spans of CDSs